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Abstract 

The goal of branch length estimation in phylogenetic inference is to estimate the divergence 
time between a set of sequences based on compositional differences between them. A number 
of software is currently available faciUtating branch lengths estimation for homogeneous and 
stationary evolutionary models. Homogeneity of the evolutionary process imposes fixed rates 
of evolution throughout the tree. In complex data problems this assumption is likely to put the 
results of the analyses in question. 

In this work we propose an algorithm for parameter and branch lengths inference in the 
discrete-time Markov processes on trees. This broad class of nonhomogeneous models comprises 
the general Markov model and all its submodels, including both stationary and nonstationary 
models. Here, we adapted the well-known Expectation-Maximization algorithm and present a 
detailed performance study of this approach for a selection of nonhomogeneous evolutionary 
models. We conducted an extensive performance assessment on multiple sequence alignments 
simulated under a variety of settings. We demonstrated high accuracy of the tool in parameter 
estimation and branch lengths recovery, proving the method to be a valuable tool for phylogenetic 
inference in real Ufe problems. Empar is an open-source C++ implementation of the methods 
introduced in this paper and is the first tool designed to handle nonhomogeneous data. 
Keywords: nucleotide substitution models; branch lengths; maximum-hkelihood; expectation-maximization 
algorithm. 
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Assuming that an evolutionary process can be represented in a phylogenetic tree, the tips of the tree 
are assigned operational taxonomic units (OTUs) whose composition is known. Here, the OTUs 
are thought of as the DNA sequences of either a single or distinct taxa. Internal vertices represent 
ancestral sequences and inferring the branch lengths of the tree provides information about the 
speciation time. 

Choice of the evolutionary model and the method of inference have a direct impact on the 
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and Halpeml |1999[ [Lockhart etalj |1994[ [Huelsenbeck and Hillis[ |1993^ [Schwartz and Mueller 



2010p . We assume that the sites of a multiple sequence alignment (MSA) are independent and 
identically distributed (i.i.d. hypothesis of all sites undergoing the same process without an effect 
on each other), the evolution of a set of OTUs along a phylogenetic tree T can be modeled by the 
evolution of a single character under a hidden Markov process on T. 

Markovian evolutionary processes assign a conditional substitution (transition) matrix to 
every edge of T. Most current software packages are based on the continuous-time Markov 
processes where the transition matrix associated to an edge e is given in the form exp(2'^?e), where 
Q^e is an instantaneous mutation rate matrix. Although in some cases the rate matrices are allowed 
to vary between different lineages (cf. Galtier and Gouy ( 1998[ ) jYang and Yoder (1999])), it is not 
uncommon to equate them to a homogeneous rate matrix 2, which is constant for every lineage in 

T. 

Relaxing the homogeneity assumption is an important step towards increased reliability 



of inference (see Eo and DeWoody (2010)). In this work, we consider a class of processes more 
general than the homogeneous ones: the discrete-time Markov processes. If x is rooted, these 
models are given by a root distribution and a set of transition matrices A'^ (e.g. chap. 8 of 
Semple and Steel ( 2003| )). The transition matrices can freely vary for distinct edges and are not 
assumed to be of exponential form, thus are highly applicable in the analyses of non-homogeneous 
data. Among these models we find the general Markov model (GMM) and all its submodels, e.g. 



discrete-time versions of the Jukes-Cantor model (denoted as JC69*), Kimura two-parameters 
(K80*) and Kimura 3-parameters models (K81*), and the strand symmetric model SSM. Though 
the discrete-time models provide a more realistic fit to the data ( |Yang and Yoder . 1999; Ripplinger 



and Sullivan 2008 2010), their complexity requires a solid inferential framework for accurate 



parameter estimation. In continuous-time models, maximum-likelihood estimation (MLE) was 
found to outperform Bayesian methods ( [Schwartz and Mueller[|2010| ). The most popular programs 
of phylogenetic inference (PAML [^iigl ([1997]), PHYLIP |Felsenstein| ([1993]), PAUP* [Swofford 
( |2003 )) are restricted to the homogeneous models. 

Though more realistic, the use of nonhomogeneous models in phylogenetic inference is not 
yet an established practice. 

Recently, Jayaswal et al. ( 2011[ ) proposed two new non-homogeneous models. With the 
objective of testing stationarity, homogeneity and inferring the proportion of invariable sites, the 
authors propose an iterative procedure based on the Expectation Maximization (EM) algorithm to 
estimate parameters of the non-homogeneous models (cf. Barry and Hartigan ( 1987a[ )). The EM 
algorithm was formally introduced by Dempster et al. ( \911) (cf. Hartley (1958])). It is a popular 
tool to handle incomplete data problems or problems that can be posed as such (e.g. missing 
data problems, models with latent variables, mixture or cluster learning). This iterative procedure 
globally optimizes all the parameters conditional on the estimates of the hidden data and computes 
the maximum likelihood estimate in the scenarios, where, unlike in the fully-observed model, 
the analytic solution to the likelihood equations are rendered intractable. An exhaustive list of 



references and applications can be found in Tanner ( 1996 1, and more recently in Ambroise ( 1998 1. 



Here, we extend on the work of Jayaswal et al. (2011 1 and present Empar, a MLE method 
based on the EM algorithm which allows for estimating the parameters of the (discrete-time) Markov 
evolutionary models. Empar is an implementation suitable for phylogenetic trees on any number 
of leaves and currently includes the following evolutionary models: JC69*, K80*, K81*, SSM and 
GMM. 

We test the proposed method on simulated data and analyze the accuracy of the parameter 
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and branch length recovery. The tests are conducted in a settings analogue to that of ISchwartz and 



Mueller (2010) and evaluate the performance of Empar on the four and six-taxon trees with several 



sets of branch lengths, JC69* and K81* models under varying alignment lengths. We present an 
in-depth theoretical study, investigating the dependence of the performance on factors such as 
model complexity, size of the tree, positioning of the branches, data and total tree lengths. 

Our findings suggest that the method is a reliable tool for parameter inference of small sets 
of taxa, best results obtained for shorter branches. 

The algorithm underlying Empar was implemented in C++ and is freely available to download 
at |http://genome.crg.es/cgi-bin/phylo_modjsel /AlgEmpar."pI[ 



METHODS 

Models 

We fix a set of n taxa labeling the leaves of a rooted tree T. We denote by A'^(t) the set of all nodes of 
T, the set of leaves as L(t), the set of interior nodes as Int{x), and the set of edges as £'(t). We are 
given a DNA multiple sequence alignment (MSA) associated to the taxa in T and a discrete-time 
Markov process on x associated to an evolutionary by a model ^ , where the nodes inr are discrete 
random variables with values in the set of nucleotides {A, C,G,T}. We assume that all sites in the 
alignment are i.i.d. and model evolution per site as follows: for each edge e of T we collect the 
conditional probabilities P{y\x,e) (nucleotide x being replaced by y at the descendant node of e) 
in a transition matrix A*^ = {P{y\x,e))xy, TT = (^Ta, TTc, tTq, %) is the distribution of nucleotides at 
the root r of T and ^ = {n, {A^)e} the set of continuous parameters of ^ on T. We denote by X 
the set of 4" possible patterns at the leaves and Y the set of 4l^'''^(^)l possible patterns at the interior 
nodes of T. In what follows, the joint probability of observing x = G X at the leaves and 
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nucleotides y = {yv)veint{T) G 7 at the interior nodes in T is calculated as 



n 



yan(v) 



where an{y) denotes a parent node of node v, ean(v),v edge from an{v) to v (note: if v is a leaf, 
then =Xv). 

In the complete model the states at the interior nodes are observed and the joint distribution 
is computed as above. On the other hand, the observed model assumes the variables at the interior 
nodes to be latent. In the latter case, the probability of observing x = (-<:/) /eL(T) at the leaves of T 
can be expressed as 

y=(yv)veint(x)^v 

Restricting the shape of the transition matrices leads to different evolutionary models 



such as JC69*, K80*, K81*, SSM (see Casanellas et al.|(2011), AUman and Rhodes (2004), and 



AUman and Rhodes (2007) for references and background on the discrete-time models). The first 



three are the discrete-time versions of the widely used continuous-time JC69 (Jukes and Cantor 



1969), K80 (Kimura, 1980) and K81 (Kimura, 1981) models. The Strand Symmetric model SSM 



(Casanellas and SuUivant (2005)) is a discrete-time generalization to the HKY model (Hasegawa 



et al. 1985 ) with equal distribution of the pairs of bases (A , T) and (C , G) at each node of the tree. 
It reflects the double- stranded nature of DNA and was found to be well-suited for long stretches of 



data ( |Yap and Pachter[[2004| ). Lastly, the general Markov model GMM ( [AUman and Rhodes] ( |2003| ); 
Steel et al. (1994])) is free of restrictions on the entries of A*^, non-stationary, and can be thought as 



a non-homogeneous version of the general time reversible model (Tavare ( 1986 )). 



Expectation-Maximization algorithm 

An algebraic approach to the Expectation-Maximization (EM) algorithm was first introduced in 
Pachter and Sturmfelsl ( |2005[ ). In this work, we adapted this approach to the context of phylogenetic 



6 



trees. 

Let D denote a MSA recorded into a vector of 4l^('^)l counts of patterns ud = (Mx)xex, 
where each Wx stands for the counts of a particular configuration of nucleotides x at the leaves, 
observed as columns in the alignment. We are interested in maximizing the likelihood function: 

x<EX 

(up to a constant). Let Ucd = {ux.y)xex.yeY be an array of counts for the complete model, where 
Mx,y is the number of times x was observed at the leaves and y at the interior nodes. The likelihood 
for the complete model has a multinomial form 

xeX,yeY xeX.yeY veN{T)\{r} 

(up to a constant), which is guaranteed to have a global maximum given by a model-specific 
explicit formula (see Supp. mat. A). 

EM algorithm iteratively alternates between the expectation (E-step) and maximization step 
(M-step). In the E-step the algorithm uses the tree topology, current estimates of parameters and 
the observed data ud to assign a posterior probability to each of the possible 4l^(^)l patterns in X 
and the expected counts of the complete model, Ucd- This step can be efficiently performed using 



the peeling algorithm of Felsenstein (2004). In the M-step the updated MLE of the parameters are 
obtained by maximizing the likelihood of the complete model ([T|). The procedure is depicted in 
Fig. 1. 



The likelihood is guaranteed to increase at each iteration of this process (e.g. Wu ( 1983), 



Husmeier et al. (2005)). Moreover, for a compact set of parameters the algorithm converges to a 
critical point of the likelihood function. Although the output of the algorithm is not guaranteed to 
be a global maximum, multiple starting points are used for optimal solution. 
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Statistical tests 

The substitution matrices are assumed stochastic. The number d of free parameters for transition 
matrices in JC69*, K80*, K81*, SSM and GMM models is 1, 2, 3, 6, and 12 respectively. The root 
distribution under the SSM (respectively GMM) model has 2 (resp. 3) free parameters, and is uniform 
for the remaining models considered here. For clarity of exposition, hereon the reference to the 
root distribution will be omitted; however, the formulas can be easily modified to include the root. 

We let be the vector of free parameters defined as the off-diagonal elements of a transition 
matrix A associated to a edge e {^^ = Ai 2, i§| is the next -from left to right, top down- off-diagonal 
entry distinct from ^f, etc). The procedure is repeated until i^J is reached. 

Let ^ = ((^f ),=i d-eeE{r) denote the vector of free parameters for an evolutionary model 



^ as above and let ^ be its MLE. Under certain regularity conditions (Zacks, 1971 , Chap. 5), ^ 
exists, is consistent, efficient and asymptotically normal with mean ^ and the covariance matrix 
given by the inverse of the Fisher information matrix (Rao 1973 Efron and Hinkley]|1978| ). The 



entries of the ^[^(t) | x d\E(T) \ Fisher information matrix I over free parameters are given by: 

l{^,A') = -^\-^^^^^^ (2) 

(see Supp. mat. B for details). The Wald statistics for testing the null hypothesis = |f , e E 
E{x),i= 1,..., J, is 

(f-^^)^F(f-^^)~;tJ, (3) 

where Y denotes the d x d slice of I corresponding to the free parameters of e E E{x). The 
p — value can thus be easily calculated by looking at the tails of the corresponding distribution. 

We tested the validity of the test statistics in our data by simulating a variety of MS As under 
the complete model and compared it to the theoretical distribution ([3]). Figure I in the Supp. mat. 
C shows high fit and proves that the setting is appropriate. 

Variances of the free parameters of the model and the full (observed) covariance matrix are 
saved in the output of Empar. These in turn can be used as the plug-in estimators in (|3]) to calculate 
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the p — values and normal confidence intervals for the parameters. 

We denote by Vf^ the diagonal entry of the matrix corresponding to the variance 

of the free parameter ^f,i= 1 , . . . , J. For the models with d > I (i.e. all but JC69*), the variances 
of the free parameters can be summarized in a combined form cV^ for each edge e: 

cV'i^') = ^ ^. (4) 



Branch lengths 

The evolutionary distance between two nodes in T joined by an edge e with substitution matrix 
is defined as the total number of substitutions per site along e. This quantity is referred to as the 
branch length of edge e (or of matrix A^) and, following on ( Barry and Hartigan]|1987b ), can be 
approximated by: 

/(A^) = -ilogdet(A^). (5) 

We denote the total length of the tree T by L^, = Eee|£(T)| ^(^^)- 

Now, let A and A' be two invertible 4x4 matrices such that the entries of {A')^^{A — A') 
are small. From ([5]), we get 

\m-l{A')\ = i|log^| = i|logdet((Ar^A)| 

= ^|logdet(Id + (A')-i(A-A'))| 

^ \\\og{\ + Tr{{A')-\A-A')))\ 

^ \\Tr{{A')-\A-A'))\ < \A\\{Ar\A'-A))\\, 

< ||(A')-i||i||A-A'||i, (6) 

where 1 1 . 1 1 1 is the maximum absolute column sum of the matrix. Therefore if A' is a good approximation 
to A, then Z(A') is a good approximation to /(A). In what follows, we use the statistical test above 
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to show the accurate recovery of the parameters. By the above argument, we can conclude that the 
estimates of the branch lengths will also be accurate (see also Results section). 



Simulated data 

Performance assessment of Empar was conducted on the MSAs simulated on four and six-taxon 



trees following (Schwartz and Mueller 2010). In the case of four taxon trees we fixed an inner 
node as the root and considered three types of topologies: T^aian^ed corresponds to the "balanced" 
trees with all five branches equal; the inner branch in Ti-2 is half the length of the exterior branches; 
and T2: 1 denotes a topology with the inner branch double the length of the external ones (see Fig. 2). 
In T^aianced ^2:1 we let the length Iq of the inner branch vary from 0.01 to 1.4, where starting 
from 0.05 it increases in steps of 0.05; in Ti-2 we let Iq vary in (0,0.7). For 6-taxon trees we used 

only balanced trees T^aianced ^^^^ ^^S- -^^ ^^'•^ ^ ^ (0,0.7). 

We simulated multiple sequence alignments on trees with 4 and 6 leaves under JC69* 



and K81* models. We used the GenNon-H package available from http://genome.crg.es/cgi-bin/ 
'phylo jnod sel/AlgGenNonH.pl , In brief, based on an input phylogenetic tree with given branch 
lengths, GenNon-H samples the substitution matrices corresponding to these lengths for all edges 
and uses them to generate the DNA MSAs following discrete-time Markov process on the tree. 
The output of this software is the alignment, the substitution matrices, root distribution (whenever 
non- stationary) and the variances of the continuous free parameters. We note that for the JC69* 
model, there is a 1 — 1 correspondence between the branch length and the free parameters of the 
substitution matrix. This does not hold for other models, were different substiution matrices may 
give the same branch length. 

We set the alignment length L to 300nt,500nt,l,000nt and 10,000nr for 4-taxa and to 
l,000n? or 10,000n? for 6-taxa. For the JC69* and K81* evolutionary models, a phylogenetic tree 
T (with branch lengths), and a given alignment length, we run each analysis 1,000 times, and 
estimated the parameters using Empar. 
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All MS As used for the tests are accessible at the Empar webpage. 



Identifiability 

It is known that in certain cases the substitution parameters are not identifiable (e.g. parameters 



at the edges adjacent to the root of valency 2). As shown in AUman and Rhodes (2003 1, the GMM 



model and its submodels, are identifiable up to a permutation of rows. Zou et al. (2011 1 showed 
that incorrect order of rows in the matrices can lead to a negative determinant of the substitution 
matrix from which the branch lengths cannot be calculated. 

We expected this problem to arise in short data sets and large branch length, as those 
correspond to the substitution matrices with smaller diagonal value. For all the data sets used for 
tests, we calculated the percentage of cases among the 1 ,000 simulations for which the parameters 
estimated by the EM algorithm were permuted. This phenomenon was only observed in the data 
sets of 300nt and l.OOOnt. In the first case, the estimated matrices were permuted when the initial 
branch length was 0.55 or longer and corresponded to 0.005-0.023% of the cases; in the latter, for 
the branches of 0.6 or longer with at most 0.001% permuted matrices. Shorter branch lengths and 
longer alignments did not suffer from the above problem and recovered the underlying order in all 
of the cases. 



As shown by Chang| ( 1996 ), the entries of the Diagonal Largest in Column (DLC) substitution 



matrices are identifiable. Namely, there exists a unique set of substitution matrices satisfying the 
DLC condition and a unique root distribution that leads to a given joint distribution at the leaves. 
In order to ensure the reliability of the results we designed a procedure that scans the tree in the 
search of the permutations that maximize the number of substitution matrices with larger diagonal 
entries. It is not possible to maximize it for all edges, thus the goal is to find the permutations 
giving more weights to the lower parts of the tree, starting with the nodes corresponding to the 
outer branches. Given a tree T, we choose an interior node to be the root, directing all edges 
outwards. For each interior node x, we apply a permutation S{x) of {A, C, G, T} that maximizes the 
sum of diagonal entries of the matrices assigned to the outgoing edges of x. Permutations S{x) are 



11 



applied recursively to the subtrees of t, moving x from the outer nodes towards the root. 



RESULTS AND DISCUSSION 



We present the results on the simulated data sets and discuss their dependence on the length of the 
alignments, the length of the branches and the depth of the branches in the tree-1 for the external 



branches and 2 for the internal branches (Schwartz and Mueller 2010). In cases with multiple 



branches of equal depth, we chose one of them at random. 



Results of statistical tests 

Each sample gave rise to a p — value based on the Xd test given by (|3]). The p — values are a 
measure of strength of evidence against the null hypothesis: for both exceptionally small or large 
p — values one can reject the null hypothesis. 

We recorded the proportion of samples for which the p — value lied in the interval (0.05, 0.95) . 
The results are shown Table [T] for the JC69* model on the Xi-2 tree (also see Tab. I-V in the 
Supp. mat. C). We observe that even for short alignments of 300nt the null hypothesis cannot be 
rejected in approximately 95% of the tests. 

Error in transition matrices 

For a given branch, we quantified the divergence D between the original and estimated parameters 
of its transition matrix A using the induced Li norm: ||A — (see ([6])). The columns in the 
transition matrices of JC69*, K80*, and the K81* are equal and the norm becomes: 

4 

D = £|A,,i-A,i|. (7) 

Figure 3 depicts the results for JC69* and K81* on the three 4-taxon phylogenies, different alignment 
lengths and depths of the branches. The shapes of the distribution of D for both models are very 
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similar. As expected, the performance is weaker for long branches and short alignments. A great 
improvement is observed with the increase in the alignment length, e.g. 10,000n? depicts very 
accurate estimates. The performance under the JC69* model (Fig. 3a )is better than that of K81* 
(Fig. 3b) for shorter branch lengths. 

Parameter dispersion 

Figure 4 shows the variances of the estimated parameters for depth 1 and 2 branches on the Tbaianced^ 
'^1:2, '^2:1 trees under the JC69* model. 

The variances show an exponential increase, which is most significant in the ^^gj^jj^gfj tree 
for both depths, and the Ti-2 for depth 2. The results for the depth 1 branch in Tbaianced and Ti-2 are 
very similar. The smallest variance was observed for the depth 2 of T2:i- For alignments of length 
10,000n? on four taxa we can say that the method is quite accurate (see also Tab. VI- VIII in the 
Supp. mat. C). 

For the K81* model we summarized the results on variances for each edge as the mean 
of combined variances of all samples (see formula (Q). The results are analogous to those of the 
JC69* model, see Figure II in Supp. mat. C. As expected, the parameter estimates are less dispersed 
for shorter branches and longer alignments (see Tab. IX-XI in the Supp. mat. C). 

Error in the branch lengths 

Using the formula (|5]) we calculated the actual difference Iq — 1 between the branch length Zq 
computed from the original parameters and the branch length / computed using their MLEs 
t,^ . Negative values of this score imply overestimation of the branch length, while positive values 
indicate underestimation. The results are shown in Figures 5 and 6. 

In the case of JC69* we observe that the method presented here does not tend to underestimate 
or overestimate the lengths for the depth 1 branches in all the 4-taxon trees (Zq — / is centered at 
(see Fig. 5). The depth 2 branches have a tendency towards overestimation of the length for 
branches longer than (approximately) 0.45 for Ti:2, 0.9 for T2:i, and 0.8 for the Tbaianced t^es. 
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In the latter case, lengths longer than 1 .2 for alignments up to 1 , OOOnt show opposite trend of 
underestimating the true lengths. The values were accurate when the alignment lengths were 
increased in the case of Ti:2 and T2:i- On the other hand, for ibalanced alignments of 10, OOOnt 
resulted in overestimation. 

In the K81* model the results are significantly more accurate (see Fig. 6). There is a trend 
of underestimation for branches longer than (approximately) 0.9 for shorter alignments. That is 
especially noticeable for ^balanced depth 1 branches of Ti:2. This trend diminishes with an 
increase in the alignment length. Overall, in the case of both models, the variance of the estimate 
is smaller for shorter lengths and both depth 1 and 2 branches of the Z2:i tree. 

In addition, we calculated the tree length (i.e. the sum of its branch lengths) from the 
estimated parameters and compared it to the theoretical result on the original branch length Iq: 
4.5/o for Ti:2 (where Iq is a depth 1 branch), 3/o for Z2:i Go for depth 2 branch) and 5/o for ibaianced- 
The rightmost columns of Figures 5 and 6 show the results for 4-taxon trees for the JC69* and 
K81* models respectively. The length of the tree is estimated accurately for all trees, the estimates 
being best for Z2:i- The variance is small and decreasing with an increase in the data length. As the 
sequences get longer, the distribution is centered around the true value. This is especially visible 
for the K81* model (see Fig. 6). 

Results for larger trees 

We run the analysis on the 6-taxon balanced tree, '^balanced' '^'^'^^^ K81* model, for alignment 
lengths of 1, OOOnt and 10, OOOnt and branch length 1 e {0.01,0.1,0.3,0.5,0.7,0.9, 1.1, 1.3, 1.4}. 
The p — values of the corresponding tests confirm that the performance of the algorithm is very 
satisfactory (see Tab. XII in the Supp. mat. C). We have seen in the 4-taxon study that the tree with 
equal branch lengths gave worse results than the unbalanced trees. Thus, we expect the results of 
the depth 2 branches to be similarly challenged in this case. 

Figure 7 depicts the estimated tree lengths. It can be seen that the estimates are accurate 
and the results improve for the alignments of 10, OOOnt. As expected, the variance of the estimates 
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increases with the increase in the length of the branch. By formula ([5]), long branches correspond 
to small values of the determinant of the transition matrix. Thus, statistical fluctuations in the 
parameter estimates have greater impact on the resulting length of the tree. 

Next, we calculated the difference between the original and estimated branch lengths. In 
Figure 8a we see that the depth 1 branches show some degree of underestimation of the length 
for lengths 1.1 — 1.4 and alignments of l,000nt. In the case of 10,000nt, the results improve and 
can be expected to show little bias for even longer data sets. Branches of depth 2 show higher 
degree of underestimation with improvement for longer data sets. The divergence of the original 
and estimated parameters for transition matrices given by formula (|7]) is shown in Figure 8b. For 
branches of depth 1 and data of length 10,000, the error is about 0.2. In the case of branches of 
depth 2, it is almost doubled for both alignment lengths. In both cases, branch lengths up to 0.5 
give satisfactory results. The error of the estimates for longer branches seems to be approaching a 
plateau. 

Combined variance of the estimated parameters is much decreased for the 10,000nt data 
sets in comparison with the l,000nt, and is smaller for the depth 1 branch (see Fig. 8c ). Again, 
the exponential shape of the plot can be attributed to the logarithm appearing in the formula (|5]). 

CONCLUSION 

In order to evaluate the performance of the method proposed here under various circumstances, we 
conducted many tests on simulated data sets. We observed that the performance of Empar is most 
optimal for long alignments and short branch lengths. 

It is worth noting that even for short alignments of 300nt or 500nt on 4 taxa, the estimated 
parameters approximate closely the original parameters in ^95% of the cases as proved by the 
normality test of the MLE. Moreover, the branch lengths calculated based on the parameters 
estimated by Empar were found very accurate already for short alignments. Though the measure 
of divergence D for the parameters of transition matrices proposed here accumulates all errors in 
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the entries of the transition matrix, alignment length of 10,000nt showed divergence values smaller 
than 0.1. 

In this paper, we provide the first implementation of a tool for inferring continuous parameters 
under the discrete-time models. The method allows for accurate estimation of branch lengths 
in non-homogeneous data. There are two limitations to applicability of the method. Firstly, 
the algorithm has an exponential computational time increase with the number of taxa. This 
is a restriction due to the fact that the algorithm computes large matrices of dimension that is 
exponential in the total number of nodes of a tree. Running time of Empar on star trees with 3-8 
nodes and equal branches of 0.5 on Ubuntu 11.10, Intel Core 17 920 at 2.67 GHz with 6 Gb is 
given in Table [2] Secondly, the memory usage of Empar is approx. 8 * 4l^(^)l and corresponds to 
the memory footprint of the matrix in the EM algorithm, e.g. for this matrix to fit in the memory of 
a 6Gb machine the bound on the number of nodes is |A^(t) | < 14. 

We conclude that Empar is a highly reliable method for estimating branch lengths of 
relatively small number of taxa and trees with short branch lengths (e.g. closely related species), 
and achieves high accuracy even when the results are based on short sequences. In particular, 
Empar is a reliable method to compute quartets and to be used with quartet-based methods (see 



Berry et al. (1999) and Berry and Gascuel (2000)) on nonhomogeneous data. 
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Tables 



Table 1: The relative frequency of p-value G (0.05,0.95) among the 1,000 tests based on the asymptotic 
normaUty of the maximum likelihood estimator under the JC69* model on the Ti:2 tree. The first column 
indicates the lengths of the depth 2 branch of Ti;2. The results are presented for both depths of the branches. 
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Table 2: Empar performance time- estimating the parameters of K81* on star trees with equal 

branch lengths of 0.5, varying number of leaves, L{t), for the MSAs of 1,000 and 10,000n?. 
length I n 3 4 5 6 7 8 

1,000 0.004 0.02 0.033 0.222 1,049 7.14 

10,000 0.011 0.043 0.171 1,044 6.95 
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Figure Captions 



Figure 1 . Expectation-maximization algorithm. 
Figure 2. Unrooted trees used for simulations: '^balanced' 

Ti:2, T2:i and 

^balanced {f^om left to right). 

Figure 3. Divergence Z)(^,|) between the parameters, t,, and their estimates, |, calculated by Empar. 
Horizontal axis: original length of the inner branch. 

Figure 4. Distribution of variances of the estimated parameters for different ahgnment lengths and different 
lengths of the depth 1 {leji) and depth 2 {right) branches under the JC69* model: Xi-2 {top), T2:i {middle), 

balanced (bottom). 

Figure 5. Error in the branch length estimation measured as the difference between the initial and the 
estimated branch lengths, /q — /, in the 1,000 simulated data sets along the T^aianced' '^1:2) '^2:1 trees under the 
JC69* model {left and middle columns). Rightmost column displays the distribution of the estimated length 
of the tree, where Iq labeling the horizontal axis corresponds to the length of the internal branch in T. 

Figure 6. Error in the branch length estimation under the K81* model (see Fig. 4 for details). 

Figure 7. Estimated tree length as a function of the initial length of a branch of ^^aianced 

(U = 9/o)in 1,000 

data sets generated under the K81* model. 

Figure 8. Results for the 1 , 000 data sets generated on the T^aianced ^^^^ ^^^* model. 
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Figures 



Require: model, 5^- phylogenetic tree, ud = {ux)\ex vector of counts. 

Initialize the values of the parameters 6 such that Px,y{d) > and choose a threshold e > 0. 
E-step: Define the expected complete data array U = {uxj)x€X,y€Y'- 

M-step: Compute the estimators 0* of by maximizing the Ukehhood function (1). 
if ^obs;uD{d*)-^obs{e;uD) > £ then 

set := 0* and return to the E-step 
else 

e := d* 
end if 

return a MLE of and the hkehhood of the observed model, Jfo},s{d; uu)- 



Figure 1 : Expectation-maximization algorithm. 
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JC69', Ti:2, depth 1 



ICr,r, 71,2, depth 2 



/fW, 7^,1, depth 1 



;f(i9", 75,1, depth 2 




(b) K81* 

Figure 3: Divergence Z)(^,|) between the parameters, ^, and their estimates, |, calculated by Empar. 
Horizontal axis: original length of the inner branch. 
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Figure 4: Distribution of variances of the estimated parameters for different alignment lengths and different 
lengths of the depth 1 {left) and depth 2 {right) branches under the JC69* model: Tia {top), Z2a {middle), 
balanced (bottom). 
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Figure 5: Error in the branch length estimation measured as the difference between the initial and the 
estimated branch lengths, /q — h in the 1,000 simulated data sets along the T^aian^ed' '^1:2) '^2:1 trees under the 
JC69* model (left and middle columns). Rightmost column displays the distribution of the estimated length 
of the tree, where Iq labeling the horizontal axis corresponds to the length of the internal branch in T. 
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Figure 6: Error in the branch length estimation under the K81* model (see Fig. 4 for details). 
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Figure 7: Estimated tree length as a function of the initial length of a branch of T^^ianced ^'-t = 9/o) in 1,000 
data sets generated under the K81* model. 
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(a) Error in the branch length estimation for distinct depths of the branches. 
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(b) The average L2 error between the original ((§) and estimated (|) parameters. 
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(c) Distribution of the combined variance for distinct depths of the branches. 

Figure 8: Results for the 1,000 data sets generated on the T^^ian^ed ^^^^ f^"" ^^^* model. 
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